1 Abstract

Researches on crimes can be “person-based” and “location-based”. In this study, I explored the relationship between demographic (person-based), street condition (location-based) and felonies’ amount. We collected data for the 2165 census tracts in New York City, and used a set of data mining regression and classification techniques. After analyzed the result to determine which is the best for explain how the demographic and street condition affect crime, I proposed a best approach which classifies the crime hierarchy of 2165 census tracts correctly more than 70% based on built environment and demographics.

2 Setup

Scripts to load neccessary packages, set up plot style and show numbers in a non-scientific notation way.

library(tidyverse)
library(grid)
library(dplyr)
library(sf)
library('RSocrata')
library(viridis)
library(caret)
library(data.table)
library(ggplot2)
library(ridge)
library(dummies)
library(caret)
library(forcats) 
library(BAMMtools)
library(rpart.plot)
library(ggpubr)
library(knitr)
options(scipen=999)
setwd("/Users/hemingyi/Documents/musa507/Felony/")
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

3 Data

3.1 Final Data Set

CensusCode TotalPop Men Women Hispanic White Black Native Asian Citizen Income Poverty Unemployment ResidencePrice StreeLightAmount TreeAmount Salesin3Years FelonyAmount Bronx Brooklyn Manhattan Queens Staten.Island
000100Brooklyn 4304 2196 2108 9.8 70.4 10.5 0 5.8 3501 70938 15.2 4.1 0 7 405 0 486 0 1 0 0 0
000100Queens 6403 3322 3081 6.0 61.5 5.5 0 21.1 4126 127807 11.4 3.5 0 10 436 0 881 0 0 0 1 0
000200Bronx 5403 2659 2744 75.8 2.3 16.0 0 4.2 3639 72034 20.0 7.7 0 2 377 0 625 1 0 0 0 0
000200Brooklyn 1868 1031 837 86.9 6.3 6.0 0 0.5 618 34740 39.8 7.9 0 2 150 0 731 0 1 0 0 0
000200Queens 3364 1699 1665 55.3 17.5 1.0 0 23.0 2067 64242 14.2 8.2 0 1 220 0 532 0 0 0 1 0
000201Manhattan 2791 1301 1490 35.3 12.4 6.2 0 46.1 2031 20521 48.6 2.6 0 2 70 0 635 0 0 1 0 0

3.2 Data Downloading

3.2.1 download NYPD Complaint Data Historic

FelonyData <- fread('https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD')

3.2.2 download nyc census tract boundary

CensusTractBoundary <- 
  st_read("https://data.cityofnewyork.us/api/geospatial/fxpq-c8ku?method=export&format=GeoJSON") %>%
  st_transform(crs=2263)
## Reading layer `OGRGeoJSON' from data source `https://data.cityofnewyork.us/api/geospatial/fxpq-c8ku?method=export&format=GeoJSON' using driver `GeoJSON'
## Simple feature collection with 2165 features and 11 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

3.2.3 Download Street Light Location

StreetLight <- fread('https://data.cityofnewyork.us/api/views/tbgj-tdd6/rows.csv?accessType=DOWNLOAD')

3.2.4 Download Trees Location

Trees <- fread('https://data.cityofnewyork.us/api/views/5rq2-4hqu/rows.csv?accessType=DOWNLOAD')

3.2.5 Download Property Sale Records

PropertySales <- read.socrata(
  "https://data.cityofnewyork.us/resource/w2pb-icbu.json?$where=sale_price > 1",
  app_token = "YOUR TOKEN",
  email     = "YOUR EMAIL ADDRESS",
  password  = "YOUR PASSWORD,,"
)

3.2.6 Read Demographics Data, this data set is available at https://www.kaggle.com/muonneutrino/new-york-city-census-data/download

Demographics <- read_csv('new-york-city-census-data//nyc_census_tracts.csv')

3.3 Data Preprocessing

3.3.1 Limit the crime to felony and happend in 2018.

FelonyData2018 <- FelonyData %>%
  mutate(year = unlist(strsplit(CMPLNT_FR_DT, "/"))[3],
         Latitude = as.numeric(Latitude),
         Longitude = as.numeric(Longitude)) %>%
  filter(LAW_CAT_CD == "FELONY", year=="2018", Longitude>-74.1, Longitude<-73, Latitude>40, Latitude<41.1) %>%
  select(Latitude,Longitude,BORO_NM) %>%
  st_as_sf(coords = c('Longitude','Latitude'), crs = 4326, agr = "constant") %>%
  st_transform(crs=2263)

3.3.2 Limit street light type, and add geometry

StreetLight <- StreetLight %>%
  filter(`Pole type` == "Streetlight Pole") %>%
  st_as_sf(coords = c('Longitude','Latitude'), crs = 4326, agr = "constant") %>%
  select(`Pole type`) %>%
  st_transform(crs=2263)

3.3.3 Add tree geometry

Trees <- Trees %>%
  st_as_sf(coords = c('longitude','Latitude'), crs = 4326, agr = "constant") %>%
  select(tree_dbh) %>%
  st_transform(crs=2263)

3.3.4 Drop na value in property sale record, and add geometry

PropertySales <- PropertySales[!is.na(PropertySales$latitude)&!is.na(PropertySales$longitude)&!is.na(PropertySales$tax_class_as_of_final_roll),]
PropertySalesDF <- PropertySales %>%
  mutate(sale_date = as.Date(sale_date)) %>%
  filter(sale_price>0,gross_square_feet>0) %>% #,sale_date>'2016-01-01'
  select(latitude,longitude,sale_price,tax_class_as_of_final_roll,gross_square_feet) 
PropertySalesDF <- st_as_sf(PropertySalesDF, coords = c('longitude','latitude'), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(CensusTractBoundary))

3.3.5 Create a “CensusTractCode” and merge demographics data on demographics boundaries

GenerateCensusCode <- function(x){
  return(substring(x,6,))
}
Demographics <- Demographics %>%
  mutate(CensusTract = as.character(CensusTract),
         CensusCode = apply(Demographics[1], 2, GenerateCensusCode),
         CensusCode = paste(CensusCode,Borough,sep=''))
CensusTractDF <- CensusTractBoundary %>%
  select(boro_name, geometry,ct2010) %>%
  mutate(ct2010=as.character(ct2010),
         CensusCode=paste(ct2010,boro_name,sep=''),
         CensusCode=as.character(CensusCode)) %>%
  merge(Demographics, by.x='CensusCode', by.y = 'CensusCode')

3.3.6 join property sale price on census tract. In this case, we only take property which tax class is class 1 or class 2 into consideration. As both class 1 and class 2 are residence, and sales most frequently on the market.

CensusTractPropertySale <- st_join(CensusTractDF, PropertySalesDF, join= st_contains)
CensusTractPropertySale <- na.omit(CensusTractPropertySale)

CensusTractPropertySaleGrouped <- CensusTractPropertySale %>%
  mutate(sale_price = as.numeric(sale_price),
         gross_square_feet = as.numeric(gross_square_feet),
    price_per_square = sale_price/gross_square_feet) %>%
  group_by(CensusCode,tax_class_as_of_final_roll) %>%
  summarise(median_sale_price = median(price_per_square)) %>%
  st_drop_geometry() %>%
  select(CensusCode,tax_class_as_of_final_roll,median_sale_price) %>%
  drop_na() %>%
  spread_(key_ = "tax_class_as_of_final_roll",
          value_ = "median_sale_price", 
          convert = TRUE,
          drop = FALSE) %>%
  rename(class1 = '1', class1A = '1A', class1C = '1C', class2 = '2', class2C = '2C', class4 = '4') %>%
  mutate(class1 = as.numeric(replace_na(class1,0)),
         class1A = as.numeric(replace_na(class1A,0)),
         class1C = as.numeric(replace_na(class1C,0)),
         class2 = as.numeric(replace_na(class2,0)),
         class2C = as.numeric(replace_na(class2C,0)),
         class1= max(class1,class1A,class1C),
         class2 = max(class2,class2C)) %>%
  select(CensusCode,class1,class2)

CensusTractDF <- merge(CensusTractDF,CensusTractPropertySaleGrouped,by.x='CensusCode',by.y='CensusCode',all.x = TRUE)

CensusTractDF$ResidencePrice <- CensusTractDF$class1 + CensusTractDF$class2

3.3.7 spatial join felony, street light and trees location to each census tract

FelonyCount <- st_join(CensusTractDF, FelonyData2018, join= st_contains) %>%
  group_by(CensusCode) %>%
  summarise(FelonyAmount = n()) %>%
  st_drop_geometry() 

StreetLightCount <- st_join(CensusTractDF, StreetLight, join= st_contains) %>%
  group_by(CensusCode) %>%
  summarise(StreeLightAmount = n())%>%
  st_drop_geometry() 

TreesCount <- st_join(CensusTractDF, Trees, join= st_contains) %>%
  group_by(CensusCode) %>%
  summarise(TreeAmount = n(),
            TreeDiameterSum = sum(tree_dbh)) %>%
  st_drop_geometry()

CensusTractDF <- merge(CensusTractDF,FelonyCount,by.x='CensusCode',by.y='CensusCode',all.x = TRUE)
CensusTractDF <- merge(CensusTractDF,StreetLightCount,by.x='CensusCode',by.y='CensusCode',all.x = TRUE)
CensusTractDF <- merge(CensusTractDF,TreesCount,by.x='CensusCode',by.y='CensusCode',all.x = TRUE)

CensusTractDF <- CensusTractDF %>%
  mutate(ResidencePrice = ifelse(is.na(ResidencePrice), 0, ResidencePrice),
         FelonyAmount = ifelse(is.na(FelonyAmount), 0, FelonyAmount),
         StreeLightAmount = ifelse(is.na(StreeLightAmount), 0, StreeLightAmount),
         TreeDiameterSum = ifelse(is.na(TreeDiameterSum), 0, TreeDiameterSum),
         TreeAmount = ifelse(is.na(TreeAmount), 0, TreeAmount),
         Salesin3Years = case_when(ResidencePrice > 0 ~ '1',TRUE ~ '0')) %>%
  select(Borough,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Citizen,
         Income,Poverty,Unemployment,ResidencePrice,StreeLightAmount,TreeAmount,Salesin3Years,FelonyAmount)

3.3.8 Drop NA value, convert factor variable “Borough” to dummy variable, and standardize independent variables.

CensusTractDF <- na.omit(CensusTractDF)

CensusTractDFNormalized <- CensusTractDF %>%st_drop_geometry() %>%select(-Borough)%>%sapply(as.numeric)
DummyVaribales <- dummy(CensusTractDF$Borough, sep = "")
X <- cbind(CensusTractDFNormalized,DummyVaribales)

AllData <- scale(X[, -c(17)]) %>%
  as.data.frame() %>%
  cbind(CensusTractDF %>% select(FelonyAmount) %>% st_drop_geometry() %>% as.data.frame())
  # rename("Bronx" = "BoroughBronx",
  #        "Brooklyn"= "BoroughBrooklyn",
  #        "Manhattan" = "BoroughManhattan",
  #        "Queens" = "BoroughQueens",
  #        "StatenIsland" = "BoroughStatenIsland")

3.3.9 Exploratory Analysis

In plot 1, we can tell that the distribution of felony amount in each census tract is a rough poisson distribution. Please keep in mind that there is no census tract in New York lucky enough to have zero felony in 2018, the minimum value is 1. Plot 2 shows the felony amount in each borough, and Staten Island is obviously the most safe place in New York city.

From plot 3, we can those red spots, which means felony amount is higher than 500, are more likely to located in Matthan, and only one in Queens.

3.4 Model

3.4.1 Linear Regression, 10-fold corss validation

model <- train(
  FelonyAmount ~ ., 
  AllData,
  method = "lm",
  trControl = trainControl(
    method = "cv", 
    number = 10,
    verboseIter = TRUE
  )
)
## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set
print(model)
## Linear Regression 
## 
## 2101 samples
##   21 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1891, 1891, 1892, 1891, 1890, 1890, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   45.62319  0.4157526  28.5671
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
coef <- model$finalModel$coefficients %>% 
  as.data.frame() %>%
  rownames_to_column("Variables") %>%
  rename(Estimate = ".")

ggplot() +
  geom_bar(data=coef, mapping = aes(x=reorder(Variables,Estimate),y=Estimate),stat="identity") +
  labs(title = "4. Standardized regression coefficients") +
  theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1)) 

After 10-fold cross validation, the r-square fot this linear regression model is 0.41. White (ratio) has the greatest negative effect on felony amount, and Manhattan has the greatest positive effect. Meanwhile, the final model does not take Staten Island and women (ratio) in to consideration, as those two variables have colinearity with others. A census tract has to located in one of five boroughs, which means only four variables are enough to cover the borough information. And generally, women (ratio) equals to 100% substract men(ratio).

3.5 Classification

In this felony prediction project, the actual number of felony amount is not that important, thus, we can group census tract as “most dangerous”, “dangerous”, “neutral”, “safe”, “most safe”. I applied Jenks Natural Break, a data clustering method designed to determine the best arrangement of values into different classes, to group it. And the classification methos is Decision Tree, also with 10-fold cross validation.

ClassDF <- CensusTractDF %>%
  mutate(FelonyClass = cut(CensusTractDF$FelonyAmount, 
                           breaks = append(0, getJenksBreaks(CensusTractDF$FelonyAmount, 5, subset = NULL))),
         FelonyClass = as.character(FelonyClass)) %>%
  select(-FelonyAmount) %>%
  st_drop_geometry()

ClassDF$FelonyClass <- factor(ClassDF$FelonyClass, levels=c("(0,1]","(1,56]","(56,130]","(130,257]","(257,569]"))

ClassDF <- na.omit(ClassDF)
ggplot(data=ClassDF, aes(x=FelonyClass)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=+1.5, color="white") + 
  theme(plot.title = element_text(hjust = 0.5))  

Census tracts are not evenly distributed in each group, to better model performance, we need to resample data to imbalance it.

ClassDFModel <- rbind(ClassDF,ClassDF[ClassDF$FelonyClass == "(0,1]", ] [rep(seq_len(nrow(ClassDF %>% filter(FelonyClass=="(0,1]"))), each=14),])
ClassDFModel <- rbind(ClassDFModel,ClassDF[ClassDF$FelonyClass == "(56,130]", ] [rep(seq_len(nrow(ClassDF %>% filter(FelonyClass=="(56,130]"))), each=1),])
ClassDFModel <- rbind(ClassDFModel,ClassDF[ClassDF$FelonyClass == "(130,257]", ] [rep(seq_len(nrow(ClassDF %>% filter(FelonyClass=="(130,257]"))), each=5),])
ClassDFModel <- rbind(ClassDFModel,ClassDF[ClassDF$FelonyClass == "(257,569]", ] [rep(seq_len(nrow(ClassDF %>% filter(FelonyClass=="(257,569]"))), each=42),])

ggplot(data=ClassDFModel, aes(x=FelonyClass)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=+1.5, color="white") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ClassDFPlot <- CensusTractDF %>%
  mutate(FelonyClass = cut(CensusTractDF$FelonyAmount, 
                           breaks = append(0, getJenksBreaks(CensusTractDF$FelonyAmount, 5, subset = NULL))),
         FelonyClass = as.character(FelonyClass)) %>%
  select(-FelonyAmount)
ClassDFPlot$FelonyClass <- factor(ClassDFPlot$FelonyClass, levels=c("(0,1]","(1,56]","(56,130]","(130,257]","(257,569]"))
ggplot() +
  geom_sf(data = ClassDFPlot, mapping = aes(fill = FelonyClass), color = 'white', size=0.1) +
  labs(title = "5. Felony Class Map",
       subtitle = "New York; 2018") +
  theme(plot.title = element_text(hjust = 0.5)) +
  # scale_fill_discrete(breaks=c("(0,1]","(1,782]","(782,1.61e+03]","(1.61e+03,2.96e+03]","(2.96e+03,8.66e+03]")) +
  scale_fill_viridis_d() +
  mapTheme()

ClassDFModel <- na.omit(ClassDFModel)

caret.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3)

rpart.cv <- train(FelonyClass ~ ., 
                  data = ClassDFModel,
                  method = "rpart",
                  trControl = caret.control,
                  tuneLength = 15)

rpart.best <- rpart.cv$finalModel
prp(rpart.best, type = 0, extra = 1, under = TRUE)

## [1] "accuracy: 0.626766276700467"

The overall accuray from 10-fold cross validation is 0.62, and the plot above shows the desicion path. Roughly speaking, the priority in the desicion path is the feature importance in this model, which means Women (ratio) is the most important feature in predicting felony amount.

4 Model Prediction

4.1 Linear Regression

CensusTractDF <- na.omit(CensusTractDF)
CensusTractDF$Predict <- predict(model$finalModel, AllData[-c(22)])
CensusTractDF$Residul <- CensusTractDF$FelonyAmount - CensusTractDF$Predict
p1 <- ggplot() +
  geom_sf(data = CensusTractDF, mapping = aes(fill = Predict), color = 'white', size=0.1) +
  scale_fill_gradient(low="#66B904",high="red",limits=c(0,570)) +
  labs(title = "6. Predict Felony Map",
       subtitle = "New York; 2018") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()
p2 <- ggplot() +
  geom_sf(data = CensusTractDF, mapping = aes(fill = FelonyAmount), color = 'white', size=0.1) +
  scale_fill_gradient(low="#66B904",high="red",limits=c(0,570)) +
  labs(title = "7. Actual Felony Map",
       subtitle = "New York; 2018") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()
ggarrange(p1, p2, ncol=2, nrow=1, common.legend = TRUE)

ggplot() +
  geom_sf(data = CensusTractDF, mapping = aes(fill = Residul), color = 'white', size=0.1) +
  scale_fill_viridis_c()+
  labs(title = "8. Residual Map",
       subtitle = "New York; 2018") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

Linear regression predicted results are more moderate than the actual value, some places in State Island are even predicted as minus felony amount, and those “extreme red” spots on Manhattan and Queens disappeared. On the ohter hand, grids in Manhattan on the predict map are more red than the grids on the actual map. From the plot 8, residual map, it is obviously that most grids have lower actual felony amount. In a nutshell, linear regression failed to predict those extreme high and low value, and shrink the differences among places.

4.2 Classification

CensusTractDF$PredictClass <- predict(rpart.cv,ClassDF %>% select(-FelonyClass), type = "raw")
CensusTractDF$FelonyClass <- ClassDF$FelonyClass
CensusTractDF$PredictClass <- factor(CensusTractDF$PredictClass, levels=c("(0,1]","(1,56]","(56,130]","(130,257]","(257,569]"))
CensusTractDF$FelonyClass <- factor(CensusTractDF$FelonyClass, levels=c("(0,1]","(1,56]","(56,130]","(130,257]","(257,569]"))
p1 <- ggplot() +
  geom_sf(data = CensusTractDF, mapping = aes(fill = PredictClass),color = 'white',size=0.1) +
  scale_fill_discrete(breaks=c("(0,1]","(1,56]","(56,130]","(130,257]","(257,569]")) +
   scale_fill_viridis_d() +
  # scale_fill_manual(values = c("#3F1451", "#2D3E77", "#1F7F79","#4FC050")) +
  labs(title = "9. Predict Felony Class Map",
       subtitle = "New York; 2018") +
  theme(plot.title = element_text(hjust = 0.5),legend.position = "none")+
  mapTheme()
p2 <- ggplot() +
  geom_sf(data = CensusTractDF, mapping = aes(fill = FelonyClass), color = 'white', size=0.1) +
  scale_fill_discrete(breaks=c("(0,1]","(1,56]","(56,130]","(130,257]","(257,569]")) +
  scale_fill_viridis_d() +
  labs(title = "10. Actual Felony Class Map",
       subtitle = "New York; 2018") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()
ggarrange(p1, p2, ncol=2, nrow=1, common.legend = TRUE)

Comparing to linear regression result, desicion tree model is able to extreme high and low values, however, the yellow grids (highest group) in predict map are much more than actual map, easpecially on Manhattan. And all the census tracts on Staten Island are predicted as the lowest felony group (0,1]. In Bronx, Queens, and Brooklyn, the results are more accurate. This situation might be casued by oversample method as this mehod exaggerate the extreme values. Extreme high values are clustered on Manhattan and extreme low values are clustered on Staten Island, and the duplicated data set emphasized the importance of Borough. Thus, more Mahattan census tracts are classified as extreme high felony group, and more Staten Island census tracts are classified as extreme low felony group.

5 Conclusion

Both regression and classification method has a satisfied performance on the felony amount prediction but they have different shorcoming. Linear regression tends to moderate the range, eliminate the geospatial heterogeneity. And classification tends to predict more extreme values. In linear regression, the most important features are Manhattan, Brooklyn, white (ratio), Street Light, Asian (ratio), Bronx, and in decision tree they are Staten Island, Manhattan, Black (ratio), Street Light, and Men. Street light is important in both models which means street condition really impacts the crime activity. Property sale price have little influence on felony amount, crime just equally happend in New York City, no matter rich or poor.

In this project, I choose felony to reflect the severe crime. However, felony not always equals to dangerous. For example, the felony amount is high in wall street area, but it might be financial crime.

In addition, I also tried some ensemble methods in Python like Adaboost Regression, XGboost regression and random forest, but those methods didn’t lead to a higher accuract.